My choice for analysis was the GoBike dataset to investigate the bike hiring trend in SF. The main focus was on trip duration, start time and season to get our own insights.
The data consisted of 16 different variables such as age, gender, weekday, time and others. It contains 3.31 billion rides. Ages in dataset from 18 to 56 takes 95% of the users in dataset. There were users more than 100 years old. So, we can remove users more than 60 years old.
The data contains various and interesting infromation from age and gender to trip duration and start and end time of trip. I only extract 10% of the datasets not all of them due to memory usage performance. Ages in the dataset are ranging from 18 to 56 takes around 95% of the users in dataset.
# Let's plot the distribution of trip duration.
data = go.Histogram(x=df["duration_min_log"])
layout = go.Layout(
title="Distribution of trip duration after log transformation",
xaxis={"showgrid":False, "title":"Duration in min"},
yaxis= {"showgrid":False, "title":"Frequency"}
)
fig = go.Figure(data, layout)
fig.show()
# presenting the 5 number summary
data = go.Box(y=df["duration_min_log"], name="Trip Duration")
layout = go.Layout(
title="Distribution of trip duration after log transformation", xaxis={"showgrid":False}, yaxis= {"showgrid":False}
)
fig = go.Figure(data, layout)
fig.show()
Conclusion 1 : As seen before, it is difficult to read the plot in trip duration per second so that I tend to perform log transformation based on base 10 to plot a normally distributed shape and answer the question precisely. It looks like that most of the trips takes 10 minutes in average - short trips.¶
# seasons vs median duration trips
season_duration_mean = df.groupby('season')['duration_min'].median().reset_index()
fig = go.Figure(
go.Bar(
x=season_duration_mean['season'].tolist(),
y=season_duration_mean['duration_min'].tolist(),
text=round(season_duration_mean['duration_min'], 2).astype(str).tolist(),
textposition="auto"
),
go.Layout(
title="Average of duration trip per season in minutes",
xaxis={"showgrid":False, "title":"Season"},
yaxis={"showgrid":False, "title":"Duration Trip Mean"}
)
)
fig.show()
Conclusion 2 : Due to outliers that exit heavily in this data, I chose to measure the average by median not mean to not mislead the results. Despite there is no significant difference in trip dutaion across seasons, the plot appears that spring has the longest median of trip duration. This was expected for me as in spring we have a very relxing whether experience that motivates going bicycling. Whether doesn't affect that much in SF. I don't know why but this might go back to unchanging extreme whether conditions.¶
fig = go.Figure(
go.Heatmap(
z=df['duration_min_log'].tolist(),
x=df['start_month'].tolist(),
y=df['season'].tolist(),
# hoverongaps = False
),
go.Layout(
title="Relationship between trip duration and months across year",
xaxis={"showgrid":False, "title":"Months"},
yaxis={"showgrid":False, "title":"Trip Duration in Mintues"},
xaxis_type="category"
)
)
fig.show()
fig = go.Figure(
go.Heatmap(
z=df['duration_min_log'].tolist(),
x=df['start_day'].tolist(),
y=df['season'].tolist(),
# hoverongaps = False
),
go.Layout(
title="Relationship between trip duration and days across week",
xaxis={"showgrid":False, "title":"Months"},
yaxis={"showgrid":False, "title":"Trip Duration in Mintues"},
xaxis_type="category"
)
)
fig.show()
Conclusion 3 : I've created season column that to plot multivariate exploration between season, months and trip duration. From the heatmap above, we can see that longest trip durations are in the summer specifically in August. Winter, in Sep and Jan, comes in the second place, while spring months come with shortest trip durations across the year. Unlike the previous bar chart that shows the longest median duratoin trip is in spring, the heatmap suggests the summer as duration count with longest ones. The second heatmap also prove the same fact that the summer has the peak duration of trips with high frequency in Wednesdays.¶